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1  Introduction 


This  introduces  methods  of  applying  the  Network  Discriminant  Function  (NDF)  to  training 
and  automatic  architectme  selection  of  feed-forward  multiple  layer  perceptrons  (MLPs)  used 
for  Pattern  Recognition(PR).  Knowledge  of  the  NDF  and  of  the  MLP  clustering  ability  may 
be  used  to  impact  MLP  architecture  choices  as  well  as  give  researchers  an  important  new 
handle  on  how  to  evaluate  MLP  performance.  In  the  past,  the  sum  of  the  square  error 
(SSE)  has  been  the  main  metric  that  researchers  had  to  consider.  The  NDF  gives  a  new 
insight  into  MLP  performance  in  that  it  reflects  how  well  the  MLP  is  clustering  the  data.  If 
clustering  is  an  intrinsic  capability  of  the  MLP  which  makes  it  perform  well  then  the  NDF 
should  correlate  with  other  performance  measures  such  as  the  SSE.  Architecture  choices  and 
trsdning  technique  need  to  be  selected  so  as  to  aid  the  MLP  clustering  ability.  The  impact 
of  architecture  choices  based  on  the  NDF  is  an  area  which  has  not  yet  received  attention  in 
the  literature. 

The  major  goals  of  the  research  are  to  establish  the  NDF  as  a  criteria  useful  for  ar¬ 
chitecture  selection  and  to  relate  and  compare  the  NDF  to  other  criteria  such  Cascade- 
Correlation’s  error  correlation.  The  Background  section  discusses  MLPs  and  pattern  recog¬ 
nition  and  in  particular  discusses  1  out  of  N  networks,  linear  last  layer  networks,  Cascade- 
Correlation(CC)  networks,  and  functional-link  networks.  Next,  application  of  the  NDF  is 
discussed.  The  NDF-Ca8cade(NDFC)  learning  architecture  and  the  MLP  NDF  are  pro¬ 
posed.  A  series  of  tests  are  described,  test  results  are  given,  and  each  test  is  discussed 
individually.  Then,  the  results  are  discussed  and  new  suggestions  for  research  made.  Fi¬ 
nally,  the  research  is  summarized. 


2  Background 

PR  is  a  well  established  field  of  study  with  a  rich  body  of  techniques  and  theories.  The 
amount  of  research  on  artificial  neural  networks  (ANNs)  has  grown  in  the  past  several  years. 
This  is  evident  from  the  creation  of  new  research  journals  such  as  the  IEEE  Transaction 
on  Neural  Networks  and  Neural  Networks.  The  areas  of  sensor  processing,  control,  and 
data  analysis  have  successfully  applied  neurocomputing  to  hard  problems  [1].  Perhaps  the 
most  significant  success  has  occurred  within  the  area  of  PR.  The  relationship  of  ANNs 
and  PR  systems  is  synergistic  [2].  PR  gives  theoretical  basis  for  functions  that  need  to 
be  performed  while  ANNs  address  many  of  the  hardest  problems  that  PR  systems  have  in 
practice.  ANNs  do  not  solve  all  the  problems  of  PR  systems  but  they  are  improving  PR 
capabilities.  The  lack  of  understanding  of  exactly  what  mapping  ANNs  are  performing  will 
slow  their  acceptance.  This  is  especially  true  for  critical  process  situations.  Engineers  need 
to  know  that  the  PR  system  is  performing  logically  and  may  want  to  modify  what  it  is 
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doing.  Automatic  selection  of  architecture  requirements  such  as  number  of  layers,  number 
of  nodes  in  a  particular  layer,  and  type  of  node  functions  is  important  to  allow  for  more 
flexible  systems  and  to  expand  usage  to  lesser  skilled  users.  These  goals  require  greater 
understanding  of  the  MLPs  and  how  the  architecture  impacts  their  performance. 

2.1  Dividing  the  Decision  Space 

MLPs  can  divide  the  decision  space  [3].  A  single  node  provides  a  boundary  in  space.  A 
linear  node  provides  a  linear  decision  boundary  while  a  nonlinear  node  provides  a  nonlinear 
decision  boundary.  Adding  additional  linear  layers  provides  no  benefit  as  the  weight  ma¬ 
trices  can  simply  be  multiplied  to  provide  an  equivalent  single  layer  network.  This  points 
out  an  inherent  limitation  of  linear  MLPs,  they  only  work  well  on  linearly  separable  prob¬ 
lems.  Nonlinear  nodes  make  the  addition  of  more  layers  useful.  The  second  layer  combines 
the  decision  boundaries  of  the  first  layer  to  provide  decision  regions  while  the  third  layer 
combines  the  regions  into  areas.  In  this  way  the  nonlinear  MLP  allows  the  decision  space 
to  be  further  divided  based  on  decision  boundaries  created  in  the  first  layer.  A  classic 
example  is  the  exclusive-or  problem  which  is  not  linearly  separable  yet  a  three  layer  MLP 
can  successively  learn  it.  This  added  ability  of  nonlinear  MLPs  may  be  attributed  to  the 
fact  that  unlike  linear  MLPs  they  can  increase  the  rank  of  the  data  [4].  Webb  and  Lowe 
experimentally  demonstrated  this  by  showing  improved  class  separation  of  a  nonlinear  hid¬ 
den  layer  with  higher- dimensional  space.  They  suggest  that  this  is  because  nonlinear  MLPs 
can  increase  rank  into  a  space  where  discrimination  is  easier  to  perform.  As  data  rank  is 
a  linear  measure,  a  hard  and  strict  rule  for  determining  layer  widths  as  a  function  of  data 
rank  seems  unlikely.  Given  that  the  MLP  can  divide  the  decision  space,  the  next  step  is  to 
derive  desirable  decision  boundaries. 

The  MLPa  considered  here  are  trained  so  sis  to  minimize  the  sum  of  square  error.  Let  tj, 
and  Op  €  be  the  pth  desired  tuget  pattern  vector  and  actual  output  vector,  respectively. 
Assume  that  each  node  has  a  weighted  coimection  to  a  node  with  fixed  output  1.  This  sJlows 
bias  thresholds  to  be  treated  implicitly.  As  the  input  vectors  are  given,  Op  is  a  function  of 
only  the  interconnection  weights.  The  sum  square  error  (SSE),  E,  over  a  given  training  set 
may  then  be  expressed  as; 

«  =  (1' 

P=1 

dp  is  a  constant  which  weights  the  error  for  pattern  p.  Op  is  a  function  of  the  weights  so  E  is 
also  a  function  of  the  weights  which  may  be  minimized  by  adjusting  the  weights.  This  is  the 
well  known  least  squares  or  autoregression  problem.  One  interpretation  this  gives  MLPs  is 
that  they  are  simply  mapping  functions  which  are  curve  fitted  to  a  set  of  data  points,  i.e., 
the  training  set  pairs. 
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2.2  The  1  out  of  N  MLP 


Of  particular  interest  to  this  research  are  the  subset  of  MLPs  which  use  1  out  of  N  encoding 
which  is  useful  for  multi-clsiss  problems.  Each  output  corresponds  to  particular  class.  A 
1  represents  the  input  vector  belongs  to  the  class  while  a  0  implies  it  does  not.  Given 
mutually  exclusive  classes  only  1  out  of  the  N  output  nodes  should  be  1  while  all  others  are 
0.  The  outputs  of  1  out  of  N  encoded  MLPs  have  proved  tc  approximate  the  optimal  Bayes 
discriminant  functions  [5].  Each  output  is  a  minimum  mean  squared-error  estimate  of  the 
corresponding  class  discriminant  function  which  is  the  probability  of  the  class  given  the 
input.  The  proof  applies  to  any  technique  which  attempts  to  minimize  the  mean  squared 
error  and  the  desired  outputs  are  0  and  1.  It  does  not  depend  on  what  type  of  nodes  are 
used  or  the  architecture  of  the  network.  Each  output  represents  the  a  posteriori  probability 
for  the  corresponding  class.  Knowing  that  the  output  is  the  a  posteriori  probability  allows 
selection  of  sensible  decision  thresholds  on  the  outputs  of  MLPs.  The  quality  of  the  ap¬ 
proximation  does  depend  on  a  good  sampling  of  the  data  and  an  architecture  capable  of 
accurate  approximation  of  the  a  posteriori  probabilities.  Importantly,  nonlinear  MLPs  can 
perform  any  mapping  [6]  so  any  probability  distribution  can  be  modeled  by  MLPs. 


2.3  Linear  Last  Layer  MLPs 

Other  authors  have  investigated  1  out  of  N  encoded  MLPs  which  use  linear  nodes  in  the 
output  layer  [4,  7,  8,  9,  10,  11].  For  linear  MLPs,  E  is  known  to  have  a  unique  minimum 
corresponding  to  the  projection  onto  the  subspace  generated  by  the  first  principal  vectors 
of  a  covariance  matrix  associated  with  the  training  patterns  [9].  Gallinari[7]  considers  a 
linear  case  with  two  linear  layers.  The  first  layer  was  proven  mathematically  to  perform 
a  discriminant  analysis  projection  while  the  second  layer  performs  a  classification  on  the 
outputs  of  the  hidden  uiuts.  For  MLPs  with  multiple  nonlinear  layers  ending  with  a  linear 
layer,  the  MLP  has  been  experimentally  demonstrated  to  map  data  into  smaller  and  smaller 
clusters  which  are  easier  to  separate  as  the  data  maps  from  layer  to  layer.  Webb[4]  proves 
that  the  outputs  of  the  hidden  layer  into  the  last  linear  layer  can  be  shown  to  maximize  the 
NDF: 

Jn  =  TrlSgS^}  (2) 

where; 


St  = 


(3) 


N 


Sb  = 

k=l 


(4) 
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where  N  is  the  number  of  .lasses,  P  is  the  number  of  training  patterns,  n*  is  the  number  of 
patterns  in  class  k,  hp  is  the  outputs  of  the  last  hidden  layer  for  pattern  p,  mg  is  the  mean 
output  vector  at  the  hidden  units  over  all  patterns,  and  is  the  mean  output  vector  at 
the  hidden  units  for  class  k  patterns.  Sg  is  similar  to  the  between-class  scatter  matrix  of 
Duda  and  Hart[l2]  while  5^  is  the  pseudo  inverse  of  the  total  unnormalized  class  covariance 
wh>'.  is  also  known  as  the  total  scatter  matrix.  Weights  are  chosen  so  as  to  maximize  the 
NDF  in  the  space  spanned  by  the  outputs  of  the  final  hidden  layer.  The  part  of  the  network 
up  through  the  output  of  the  hidden  layer  clusters  the  classes  based  on  the  NDF  while  the 
final  layer  performs  an  optimal  mapping  onto  the  targets.  So  the  MLP  optimizes  a  specific 
feature  extraction  and  performs  classification  simultaneously. 


2.4  NDF  Adjustment 

Lowe[8]  discusses  pattern  error  weighting  (choosing  dp)  and  target  coding.  Weights  and 
target  assignment  values  can  be  selected  so  as  to  directly  compensate  for  imeven  class 
distributions.  The  cost  of  assigning  a  class  incorrectly  can  also  be  incorporated  by  selection 
of  target  vector  (i.e.,  output  is  something  other  than  just  Is  and  Os)  but  requires  a  post 
process  decision  on  the  output  vector.  The  way  the  MLPs  with  a  linear  output  layer 
perform  is  understood  well  enough  to  allow  modification  of  the  discriminant  functions.  In 
particular,  the  ability  to  include  the  cost  of  wrong  classification  is  important.  For  example, 
identifying  a  friendly  fighter  as  an  enemy  fighter  could  lead  to  uimecessary  fratricide  so  this 
cost  consideration  may  be  considered  in  network  design.  Output  target  vector  values  and 
pattern  error  weights  can  be  selected  logically  based  on  the  selection  of  a  desirable  NDF. 


2.5  Fisher's  Discriminant 

A  well  known  discriminant  used  in  PR  is  Fisher’s  discriminant [12,  13): 

5b  I 


Jf  = 


Sw 


where  Sg  is  the  standard  between-class  scatter  matrix:  covariance: 


N 

Sg  ni,{mg^  -  mg){ihg^  -  rngY 

fc=i 

is  the  within-class  scatter  matrix: 


(5) 

(6) 


Sw  -  St  -  Sg  (7) 

The  determinant  is  a  measure  of  volume  spanned  by  the  points,  so  increasing  the  determi¬ 
nant  corresponds  to  increasing  the  volume  and,  thus,  the  distance  between  points.  Thus, 
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maximizing  Fisher’s  discriminant  corresponds  to  increasing  |  5^  [,  the  distance  between 
classes,  while  decreasing  |  Sw  |,  the  distance  between  points  in  a  class.  The  NDF  Sb  is 
similar  to  Fisher’s  Sb  but  it  weights  the  larger  classes  more  heavily. 

2.6  Cascade-Correlation  and  Functional-Link 

For  a  more  complete  description  of  Cascade-Correlation  (CC),  see  Fahlman’s  report[14]. 
CC  is  a  relatively  new  architecture  generating  algorithm  which  uses  error  correlation  to 
build  up  to  the  final  network.  The  learning  process  begins  with  two  layers.  The  first  layer 
consists  of  the  input  and  the  bias  node  while  the  other  is  simply  the  output  layer.  The 
network  is  trained.  If  the  network  doesn’t  work  then  a  new  unit  is  created  which  takes 
input  from  ail  other  nodes  except  for  the  output  layer  nodes.  Thus,  the  new  imit  becomes 
a  new  layer  in  the  network.  The  new  unit  is  trained  to  maximize  its  output  correlation  to 
the  error; 

5  =  (8) 

0=1  p=l 

where  Vp  and  Ep^o  are  respectively  the  candidate  unit’s  output  value  for  pattern  p  and  the 
current  network  error  for  pattern  p  at  output  o,  and  where  V  and  Eo  are  respectively  the 
average  of  the  candidate  unit’s  output  and  output  o’s  error.  Thus,  the  unit  becomes  a 
feature  detector  for  problematic  patterns.  Once  the  new  unit  is  added,  the  weights  into 
the  output  layer  sire  retrained  and  the  learning  process  repeats  until  the  network  works 
successfully.  Instead  of  only  one  unit,  a  pool  of  units  with  the  same  or  varying  functions 
csm  be  used.  The  unit  with  the  msiximum  error  correlation  is  selected.  This  incresues  the 
odds  of  getting  a  good  unit. 

The  cascade  sirchitecture  generated  by  CC  can  be  considered  to  be  an  instsmtiation  of 
the  functionsd-link  net  discussed  by  Pao[15].  In  the  functionsd-link  concept  only  two  layers 
are  ever  used.  The  first  layer  consists  of  the  inputs  and  functions  on  the  inputs.  Pao  showed 
that  the  right  set  of  functions  leads  to  networks  that  train  easily  to  good  solutions.  CC 
simply  uses  error  correlation  as  a  basis  for  selecting  new  functions  of  the  original  inputs  to 
be  added  to  the  first  layer. 

3  Applying  the  NDF 

Does  the  NDF  have  properties  which  makes  it  a  good  search  criteria  for  fimctions  or  aiding 
weight  learning?  For  linear  last  layer  MLPs,  SSE  is  maximized  when  the  NDF  is.  The 
NDF  appears  to  be  the  measure  for  how  MLPs  clusters  data  as  it  maps  through  the  layers. 
Intuitively,  the  NDF  measures  how  separable  the  data  is.  A  bigger  NDF  is  better  as  the 
SSE  learning  seeks  to  maximize  it.  Therefore,  the  NDF  is  a  basic  property  which  may  be 
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useful  in  training  and  automatic  architecture  selection.  For  example,  some  units  and  their 
functions  may  aid  the  clustering  process  better  than  others  and,  thus,  lead  to  MLPs  which 
train  quicker  as  well  as  perform  better.  The  NDF  may  be  useful  for  selecting  these  units 
from  a  pool  of  units  or  actually  growing  new  units  as  in  CC.  The  NDF  may  be  useful  as 
another  criteria  in  the  gradient  search  cost  function.  These  ideas  are  applied  in  the  following 
algorithm  proposals. 

3.1  Unit  Selection 

In  the  functional  link  concept  presentation  by  Pao(15],  units  were  simply  the  possible  com¬ 
binations  of  the  original  inputs  going  into  “and”  gates.  These  units  could  be  put  into  a  pool 
of  umts  where  an  algorithm  could  consider  each  for  selection.  In  this  case,  the  NDF  could 
be  used.  Initially,  only  the  original  inputs  are  given.  For  each  unit  in  the  pool,  calculate  its 
outputs  for  each  pattern.  Train  the  net  and  check  if  the  solution  is  good.  If  not,  select  a 
unit  from  the  pool.  For  each  unit,  calculate  the  NDF  assuming  its  output  is  added  to  the 
current  input  pattern.  Remove  the  unit  with  the  maodmum  NDF  from  the  pool  and  add 
it  permanently  to  the  pattern.  Repeat  the  training  and  selection  process  until  the  network 
gives  a  good  solution  or  the  pool  is  emptied. 

3.2  The  NDF'Cascade  Learning  Architecture 

The  NDFC  approach  is  similar  to  CC.  Instead  of  choosing  units  which  maximize  error 
correlation,  units  are  selected  to  maximize  the  NDF.  Based  on  the  requirements  leading  to 
the  NDF,  the  last  layer  consists  of  linear  nodes  and  1  out  of  N  encoding  is  used.  As  the 
last  layer  is  linear,  weights  into  the  output  layer  can  be  directly  calculated  using  standard 
linear  least  squares  parameter  calculations[l6]. 

3.3  The  NDF  MLP 

The  NDF  MLP  approach  selects  MLP  weights  using  the  cost: 

C  =  £  -  KJn  (9) 

where  K  is  the  NDF  cost  weight.  As  minimizing  SSE  implies  maximizing  the  NDF,  the 
global  minimum  should  be  the  same  but  the  derivative  information  may  be  different  and, 
thus,  lead  to  learning  good  solutions  that  minimizing  SSE  by  itself  would  not  find. 

4  Tests  and  Results 

A  number  of  tests  were  performed  which  explore  the  NDF  and  its  usefulness.  For  test 
purposes,  a  network  was  deemed  to  have  a  good  result  when  for  each  pattern,  the  correct 
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Unit 

Unit 

Number 

Function 

1 

to  n  i\ 

2 

to  n  t2 

3 

*2  *2 

4 

to  n  t’l  n  t'j 

Table  1:  Unit  Numbers  for  3-Bit  Problems 
class  output  is  maximum. 

4.1  NDF  V8.  Fisher 

To  test  the  correlation  between  NDF  and  Fisher,  a  two-input  three-class  problem  was 
constructed.  Normal  distributions  were  used  to  generate  the  inputs.  Both  inputs  for  each 
class  used  the  same  normal  distribution.  Letting  N{(i,  ff)  represent  a  normal  distribution 
with  mean,  p,  and  standard  deviation,  a,  the  distributions  used  were  iV(0, 1),  N(3, 1), 
and  iV^(-3, 1)  for  classes  one,  two,  and  three.  Given  1000  data  sets  each  with  100  points 
randomly  selected  to  be  from  one  of  the  three  classes  and  then  randomly  selected  from  that 
class,  the  correlation  coefficient  between  NDF  and  Fisher  was  foimd  to  be  0.278.  If  the 
correlation  coefficient  between  the  NDF  and  Fisher  were  one,  then  NDF  and  Fisher  could 
be  considered  equivalent. 

4.2  Unit  Selection  Tests 

Analysis  on  the  3-bit  parity  problem  and  the  3-bit  symmetry  problem  has  been  performed 
in  order  to  demonstrate  the  feasibility  of  using  the  NDF  to  select  units  from  a  pool  of 
units  available  to  a  fxmctional  link.  The  algorithm  considered  simply  selects  from  the  unit 
pool  the  unit  which  maximizes  the  criteria.  For  comparison  purposes  CC  5  has  also  been 
considered.  For  the  test  problems,  the  units  in  the  unit  pool  are  numbered  as  in  Table  1 
where  ij  is  input  j. 

4.2.1  3-Bit  Parity 

For  the  3-bit  parity  problem,  Table  2  shows  all  the  NDF  values  for  permutations  used  in 
the  possible  search  paths  for  determining  which  units  to  add.  Unit  4  would  be  the  first  unit 
added  because  it  gives  the  maximum  NDF  of  1.00  while  adding  any  other  unit  gives  a  zero 
NDF.  Given  Unit  4,  adding  any  of  the  remaining  units  gives  an  NDF  of  1.33  so  any  of  these 
could  be  used.  The  next  unit  selected  results  in  an  NDF  of  2.00  and  finally  adding  all  the 
units  gives  an  NDF  of  4.00.  At  this  point,  a  good  solution  is  obtained. 
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Table  2:  3-Bit  Parity  Unit  Selection  by  NDF 
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Used 

Possible 

S 

Units 

Unit 

Value 

1 

0.3750 

2 

0.6250 

3 

0.6250 

4 

0.6875 

4 

1 

0.2500 

2 

1.0000 

3 

1.0000 

4,2 

1 

0.5000 

3 

1.1666 

4,3 

1 

0.5000 

2 

1.1666 

4,2,3 

1 

0.5000 

Table  3:  3-Bit  Parity  Unit  Selection  by  S 


Table  3  shows  all  the  5  values  for  permutations  used  in  the  search  paths  for  determining 
which  units  to  add.  5  order  of  selecting  the  units  turned  out  to  be  Unit  4,  Unit  2  or  3, 
remaining  of  Unit  2  or  3  not  picked  and  finally  Unit  1 . 

4.2.2  3*Bit  Symmetry 

For  the  3-bit  symmetry  problem,  Table  4  shows  all  the  permutations  for  use  of  units  and 
their  corresponding  NDF  value.  Based  on  the  NDF  table,  Unit  2  wotild  be  used  first.  The 
network  can  learn  the  problem  given  only  this  node.  Given  Unit  2  adding  any  of  the  other 
units  does  not  increase  the  NDF. 


Used 

Possible 

NDF 

Units 

Unit 

Value 

1 

0.00 

2 

4.00 

3 

0.00 

4 

0.00 

2 

1 

4.00 

3 

4.00 

4 

4.00 

Table  4:  3-Bit  Symmetry  Unit  Selection  by  NDF 
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Used 

Possible 

S 

Units 

Unit 

Value 

1 

0.3750 

2 

2.3750 

3 

0.3750 

4 

1.1875 

2 

1 

3 

4 

Table  5:  3- Bit  Symmetry  Unit  Selection  by  S 


Table  5  shows  that  5  also  directly  selects  Unit  2. 

4.2.3  Discussion  of  Unit  Selection  Tests 

The  unit  selection  tests  show  that  the  NOF  can  be  used  as  a  criteria  for  selecting  units  to 
be  used  in  a  ftmctional  link.  The  NDF  selects  units  which  lead  to  good  answers.  CC  5 
criteria  appears  to  select  units  in  a  more  restrictive  fashion  than  the  NDF  criteria  but  it 
selects  units  which  also  maximize  the  NDF.  In  the  3- bit  symmetry  problem,  the  NDF  stops 
growing  once  the  only  needed  unit  is  added.  This  may  be  an  indicator  that  further  node 
selection  is  worthless.  CC  5  does  not  have  a  similar  heuristic  at  this  point  in  the  learning. 


4.3  NDFC  Tests 


The  NDFC  approach  has  been  implemented.  Problems  were  run  in  order  to  test  NDFC  and 
allow  comparison  of  it  to  CC.  In  both  approaches,  conjugate  gradient[16]  is  used  to  train 
the  new  units  while  the  weights  into  the  linear  output  layer  are  calculated  using  the  singular 
value  decomposition  approach  to  solve  linear  least  square  problems[16].  The  cascade  units 
used  the  sigmoid  function; 


(10) 


Derivatives  are  calculated  using  numerical  methods  rather  than  by  direct  calculation  of 
formula. 


4.3.1  3>Bit  Parity  Test 

Both  methods  were  tested  on  the  3-bit  parity  problem.  First,  a  pool  size  of  1  was  used. 
CC  required  2  cascade  units  to  learn  the  data.  The  NDF  started  at  0.0  with  no  new  units, 
increased  to  1.0  with  1  unit,  and  finally  increased  to  4.0  with  2  units.  The  number  of 
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incorrectly  learned  patterns  started  at  4,  dropped  to  1,  and  finally  ended  at  0.  NDFC 
showed  no  ability  to  learn  as  the  error  count  stayed  at  4  after  3  cascade  units  were  added 
and  the  NDF  remained  a  constant  0.0.  Next,  the  pool  size  was  increased  to  10  where  each 
unit  starts  with  different  random  weights.  Both  NDFC  and  CC  required  only  a  single  new 
unit  to  learn  the  3-bit  parity  problem.  For  both  methods,  the  number  of  incorrectly  lesuned 
patterns  started  at  4  and  dropped  to  0  with  the  new  unit.  For  both  methods,  the  NDF 
starts  at  0.0  with  no  new  units  and  increases  to  4.0  with  the  new  unit. 

4.3.2  4-Bit  Parity  Test 

Both  methods  were  tested  on  the  4-bit  parity  problem.  Each  method  used  a  pool  size  of 
10.  CC  required  2  new  units  to  learn  the  4-bit  parity  problem.  The  number  of  incorrectly 
learned  patterns  started  at  8,  dropped  to  4,  and  finally  ended  at  0.0.  The  NDF  started 
at  0.0,  increased  to  1.6,  and  finally  ended  at  8.0.  NDFC  required  3  imits.  The  number  of 
incorrectly  learned  patterns  started  at  8,  dropped  to  1,  stayed  at  1,  and  finally  ended  at  0. 
The  NDF  started  at  0.0,  increased  to  3.79,  stayed  at  3.79,  and  finally  ended  at  8.0.  Cascade 
Unit  2  had  0.0  output  variance  which  means  the  unit  does  nothing  so  a  check  was  added  to 
throw  out  any  candidate  units  with  variance  less  than  0.001.  With  this  change,  the  second 
unit  took  the  NDF  to  8.0  and  reduced  the  number  of  incorrectly  learned  patterns  to  0. 

4.3.3  Quadrant  Test 

Both  methods  were  tested  on  a  contrived  two  input  data  set.  Class  1  data  points  were  points 
which  had  both  inputs  in  the  range  0  to  5.  Otherwise  they  were  Class  2.  This  was  sampled 
using  a  grid  of  points,  11  across  each  axis  in  the  range  0  to  10.  Thus  Class  1  forms  the  lower 
left  quadrant  of  the  samples.  Using  a  pool  size  of  10,  CC  required  6  cascade  units  to  obtain 
the  answer  while  NDFC  only  required  1 .  Figures  1  and  2  show  the  mappings  obtained  by 
NDFC  as  learning  progressed  while  Figures  3,  4,  and  5  show  those  for  CC.  These  figures 
are  maps  of  the  classification  obtained  for  a  sample  grid  with  each  axis  having  64  points 
ranging  from  -1  to  11.  To  verify  this  performance,  the  test  was  repeated  using  Fahlman’s 
own  CC  code,  “cascor.”  Given  20  trials  with  a  pool  size  of  8,  the  smallest  network  had 
4  cascade  units.  Given  10  trials  with  a  pool  size  of  64,  the  smallest  network  again  had  4 
cascade  units.  Given  20  trials,  a  pool  size  of  8,  and  a  sigmoid  last  layer  rather  than  linear, 
the  smallest  network  had  2  cascade  units. 

4.3.4  Iris  Test 

Both  methods  were  tested  on  Fisher’s  iris  plant  data[12,  17,  18].  This  is  perhaps  the  best 
known  database  fotmd  in  the  pattern  recognition  literature.  Each  sample  has  the  four 
predictive  attributes  of  sepal  length,  sepal  width,  petal  length,  and  petal  width  for  three 
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Figure  1:  NDFC  Resulting  Map  for  the  Quadrant  Test  with  No  Cascade  Unit. 
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NDFC  Result  for  Rectangle  Test  §5  After  1  Cascade  Units 
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Figure  2:  NDFC  Resulting  Map  for  the  Quadrant  Test  with  One  Cascade  Unit. 
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Figtire  3:  CC  Resulting  Map  for  the  Quadrant  Test  with  No  Cascade  Unit. 


Figtire  4:  CC  Resulting  Map  for  the  Quadrant  Test  with  One  Cascade  Unit. 
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Figure  5:  CC  Resulting  Map  for  the  Quadrant  Test  with  Six  Cascade  Units. 
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species  of  iris  which  are  Setosa,  Versicolor,  and  Viginica.  The  data  set  has  50  samples  of 
each  class  for  a  total  of  150  samples.  The  Setosa  is  linearly  separable  from  the  other  two 
classes  while  the  Versicolor  and  Virginica  are  not  linearly  separable  from  each  other.  First, 
a  pool  size  of  10  was  used.  Figure  6  shows  the  effect  of  each  cascade  unit  on  the  error 
count  using  NDF  and  CC  to  choose  units.  Both  methods  required  five  new  units  to  learn 
the  data  correctly.  Figure  7  shows  the  NDF  for  both  methods  as  the  new  units  are  added. 
Next,  each  method  was  given  a  pool  size  of  50.  For  NDFC,  two  different  candidate  unit 
variance  limit  checks,  10~’°  (Zero)  and  0.001  were  tried.  Figures  8  and  9  show  the  results. 
CC  and  NDFC  with  variance  limit  10~*°  failed  to  learn  the  data  after  5  units  while  NDFC 
with  variance  limit  0.001  again  required  5  units  to  learn  the  data. 


4.3.5  Criteria  Derivative  Teats 

To  test  whether  or  not  NDFC  and  CC  agree  on  units  created,  criteria  derivative  tests  were 
performed  using  the  3-bit  parity  data.  Using  the  initial  data,  candidate  units  were  trained 
based  on  the  NDFC  and  CC  5.  Each  method  was  given  10  tries  to  maximize  its  criteria. 
The  candidate  units  created  were  then  given  as  initial  conditions  to  the  opposing  method. 
In  both  cases,  the  initial  cost  derivative  was  fotmd  to  be  zero  so  no  changes  were  made. 


4.3.6  Discussion  of  NDFC  Tests 

The  NDFC  test  results  demonstrate  that  NDFC  can  be  used  to  create  cascade  networks 
which  reach  good  solutions.  The  pool  size  impacts  the  performance  of  both  algorithms. 
In  the  3- bit  parity  test,  increasing  the  pool  size  helped  both  algorithms  find  a  solution 
requiring  only  one  unit  where  before  CC  required  two  units  and  NDFC  did  not  work  at 
all.  NDFC  foimd  a  smaller  network  in  the  quadrant  case.  In  the  case  of  the  iris  data,  CC 
performed  oddly  in  that  a  pool  size  of  10  took  only  5  units  while  a  pool  size  of  50  still  did 
not  give  a  good  solution  after  9  imits.  As  the  larger  pool  size  case  included  all  the  same 
starting  weights  as  the  smaller  pool  size,  this  pool  size  effect  suggests  that  CC  may  be  a 
greedy  algorithm.  The  success  of  CC  and  NDFC  with  a  pool  size  of  10  showed  that  a  good 
solution  given  5  units  exists  yet  CC  failed  to  find  one  even  given  a  pool  size  of  50.  Putting 
a  lower  limit  on  the  candidate  unit  output  variance  proved  to  be  useful  for  the  4-bit  parity 
and  the  iris  test  in  that  it  increased  NDFC  performance.  Choice  of  the  limit  needs  further 
study.  In  particular,  knowing  a  good  value  for  the  limit  would  avoid  useless  runs.  The 
criteria  derivative  tests  show  that  the  NDF  and  CC  S  share  common  zero  derivative  points. 
This  suggests  that  there  is  some  agreement  between  them  m  to  which  weights  to  select. 
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Figure  6:  Number  of  Cascade  Units  vs.  Error  Coimt  (Pool  Size  10) 
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Figure  7:  Number  of  Cascade  Units  vs.  NDF  (Pool  Size  10) 
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Iris  Data  for  Pool  Size  50 
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Figure  8:  Number  of  Cascade  Units  vs.  Error  Count  (Pool  Sise  50) 
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Figure  9:  Number  of  Cascade  Units  vs.  NDF  (Pool  Siae  50) 
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NDF 

Weight 

2-bit 

Symmetry 

4-bit 

Symmetry 

0.0 

(22^) 

1 

(4  3  2) 

1 

1.0 

7 

7 

8.0 

18 

45 

20.0 

24 

63 

100.0 

25 

41 

Table  6:  Impact  of  NDF  on  MLP  Learning.  Number  of  Good  Solutions  when  Initial  Weights 
are  C^(-0.2,0.2) 


NDF 

Weight 

2-bit 

Symmetry 

0.0 

1.0 

8.0 

20.0 

(2  2  2) 
157 

23 

58 

48 

Table  7;  Impact  of  NDF  on  MLP  Learning.  Number  of  Good  Solutions  when  Initial  Weights 
are  I/'(-0.6,0.6) 

4.4  NDF  MLP  Tests 

To  see  if  adding  the  NDF  into  the  cost  used  to  optimire  an  MLP  helps  the  learning,  tests  were 
run  using  different  values  of  the  NDF  cost  weight,  K.  Conjugate  gradient  was  used  to  train 
the  MLPs.  Hidden  layer  nodes  were  sigmoid  function  units  while  the  last  layer  used  linesur 
nodes.  Weights  were  random  selected  based  on  a  uniform  distribution  of  values  between 
plus  or  minus  0.2  ({7(-0.2,0.2)).  Given  the  same  500  sets  of  initial  random  weights,  the 
number  of  good  solutions  were  counted.  Table  6  shows  the  results  for  a  couple  of  problems. 
The  MLP  dimensions  used  for  each  problem  are  given  in  parenthesis,  e.g.,  (4  3  2)  means  that 
the  4-bit  symmetry  problem  used  a  net  with  a  4-node-input  layer,  a  3-node-hidden  layer, 
and  a  2-node-output  layer.  Table  7  shows  the  impact  of  increasing  the  weight  distribution 
rsmge  to  t/'(-0.6,0.6)  on  the  2-bit  symmetry  problem. 

4.6  Discussion  of  NDF  MLP  Tests 

The  tests  show  that  using  the  NDF  in  the  cost  increases  the  odds  of  getting  a  good  solution 
only  if  poor  initial  conditions  are  used.  In  fact,  using  it  with  good  initial  weights  appears 
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to  be  detrimental.  This  makes  it  of  limited  value  given  the  calculation  overhead  and  the 
fact  that  it  is  easy  to  adjust  the  initial  weight  range.  If  finding  good  initial  weights  proves 
hard,  then  using  the  NDF  could  be  useful. 


5  Discussion 

The  NDF  is  useful  for  training  functional  link  networks  including  cascade  networks.  The 
success  of  the  unit  selection  and  NDFC  tests  demonstrates  this.  These  tests  suggest  that 
a  close  tie  exists  between  the  NDF  and  CC  error  correlation  value,  5.  The  unit  selection 
tests  suggest  that  a  search  based  on  S  forms  a  more  restrictive  search  space  than  an  NDF 
based  search  but  it  also  picks  points  which  maximize  the  NDF.  The  NDFC  tests  showed 
that  CC  also  maximizes  the  NDF.  The  NDFC  tests  also  showed  that  NDFC  can  find  smaller 
networks  than  CC.  The  criteria  derivative  tests  demonstrate  that  for  a  MLP,  NDF  and  5 
share  common  zero  derivative  points  so  that  their  searches  may  lead  to  the  same  weights 
being  selected. 

Although  NDFC  and  NDF  MLP  performed  well  for  some  problems,  the  NDF  is  a  costly 
calculation  compared  to  the  SSE  and  the  error  correlation  calculations.  The  NDF  does 
not  appear  to  have  a  straight  forward  derivative  calculation  like  SSE  and  error  correlation. 
Rim-time  trade-offs  need  to  be  examined.  It  may  be  fruitful  to  use  the  NDF  cost  initially 
or  off  and  on  to  aid  the  learning  process.  Also,  the  anomaly  in  the  NDF  calculation  needs 
further  examination  to  determine  the  cause  and  possibly  a  solution. 

The  NDF  calculation  needs  further  research.  The  following  questions  need  to  be  ad¬ 
dressed: 

•  In  cascade  network  building,  does  maximizing  error  correlation  theoretically  imply 
maximizing  the  NDF? 

•  Can  the  NDF  be  applied  initially  or  intermittedly  to  aid  learning  and/or  architecture 
selection? 

•  Can  the  NDF  be  used  to  create  another  discriminant  function  which  is  less  cost 
intensive? 

•  Do  NDF  related  characteristics  give  early  indication  that  training  is  not  being  suc¬ 
cessful? 

•  In  NDFC,  can  an  acceptable  candidate  unit  output  variance  lower  limit  be  estimated? 

•  In  CC,  can  back-track  searches  be  applied  to  find  smaUer  networks  and  can  heuristics 
be  fotmd  to  speed  the  training  time. 
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6  Conclusions 


The  utility  of  the  NDF  as  a  criteria  useful  for  architecture  selection  has  been  established. 
The  NDF  has  successfully  been  used  to  select  units  for  ftmctional  link  networks  including 
cascade  type  networks.  The  NDF  has  some  although  limited  use  in  increasing  the  odds  of 
finding  good  solutions  in  MLPs.  Trade-offs  stiU  need  to  be  performed  to  establish  break¬ 
points  in  the  usefulness  of  using  the  NDF  calculation  versus  others  which  have  cheaper 
calculations.  Questions  for  further  research  have  been  identified. 
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